Remote direct memory access (rdma)-based recovery of dirty data in remote memory

ABSTRACT

Techniques for implementing RDMA-based recovery of dirty data in remote memory are provided. In one set of embodiments, upon occurrence of a failure at a first (i.e., source) host system, a second (i.e., failover) host system can allocate a new memory region corresponding to a memory region of the source host system and retrieve a baseline copy of the memory region from a storage backend shared by the source and failover host systems. The failover host system can further populate the new memory region with the baseline copy and retrieve one or more dirty page lists for the memory region from the source host system via RDMA, where the one or more dirty page lists identify memory pages in the memory region that include data updates not present in the baseline copy. For each memory page identified in the one or more dirty page lists, the failover host system can then copy the content of that memory page from the memory region of the source host system to the new memory region via RDMA.

CROSS-REFERENCES TO RELATED APPLICATIONS

This present application is a continuation of U.S. patent applicationSer. No. 17/876,395 filed Jul. 28, 2022 and entitled “Remote DirectMemory Access (RDMA)-Based Recovery of Dirty Data in Remote Memory,”which is a continuation of U.S. patent application Ser. No. 17/321,673filed May 17, 2021 and entitled “Remote Direct Memory Access(RDMA)-Based Recovery of Dirty Data in Remote Memory,” the contents ofboth which are incorporated herein by reference in its entireties forall purposes.

BACKGROUND

Unless otherwise indicated, the subject matter described in this sectionis not prior art to the claims of the present application and is notadmitted as being prior art by inclusion in this section.

In business-critical computing environments, maintaining highavailability (HA) of the workloads running in the environments is a keygoal. Without HA, such environments are vulnerable to failure events(e.g., power outages, hardware failures, software failures, etc.) thatcan render their workloads unavailable, resulting in serviceinterruptions and consequent losses in productivity, revenue, and/orbusiness reputation.

According to one HA approach, a workload running on a first (i.e.,“source”) host system in a computing environment can have its in-memorydata flushed on a periodic basis from the source host system's physicalmemory (e.g., volatile dynamic random-access memory (DRAM) modules,non-volatile DIMMs (NVDIMMs), etc.) to a shared storage backend. If afailure occurs at the source host system, any remaining dirty data inthe source host system's physical memory written by the workload sincethe last periodic flush can be synchronized to the shared storagebackend. A second (i.e., “failover”) host system in the computingenvironment can then recover the data from the shared storage backend,thereby allowing the workload to resume execution on that failover hostsystem while the source host system is taken offline for maintenance.

However, a significant issue with this HA approach is that it assumesthe operating system (OS) or hypervisor running on the source hostsystem is in a sufficiently operational state after the failure to syncthe workload's remaining dirty data to the shared storage backend. Thisassumption will generally be valid if the failure is a AC poweroutage—in which case a backup power source such as an on-board batteryor uninterruptable power supply (UPS) can provide power to the sourcehost system for a short period of time while the OS/hypervisor completesthe dirty data synchronization—or a non-critical error. But thisassumption will not be valid if the failure is caused by anunrecoverable error in the OS/hypervisor's kernel (sometimes referred toas a kernel panic).

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts an example computing environment.

FIG. 2 depicts a modified version of the computing environment of FIG. 1that implements the techniques of the present disclosure.

FIG. 3 depicts an RDMA setup workflow that may be executed by a sourcehost system and a failover host system according to certain embodiments.

FIG. 4 depicts an RDMA-based recovery workflow that may be executed by afailover host system according to certain embodiments.

DETAILED DESCRIPTION

In the following description, for purposes of explanation, numerousexamples and details are set forth in order to provide an understandingof various embodiments. It will be evident, however, to one skilled inthe art that certain embodiments can be practiced without some of thesedetails or can be practiced with modifications or equivalents thereof.

1. Overview

The present disclosure is directed to a novel HA approach that leveragesremote direct memory access (RDMA) to recover, by a failover host systemin a computing environment, dirty data maintained in a physical memoryof a source host system in the computing environment at the time of afailure at the source host system. As known in the art, RDMA is atechnology implemented at the network interface controller (NIC) levelthat enables data to be transferred between the physical memories of twonetworked computer systems without any involvement by the centralprocessing units (CPUs) or OSs/hypervisors on either side.

Unlike other HA approaches that require the OS/hypervisor at the sourcehost system to be mostly intact/operational post-failure (in order toflush the dirty data to some destination such as a shared storagebackend), the RDMA-based approach of the present disclosure is notlimited by this requirement. Accordingly, this RDMA-based approach canbe employed in scenarios where the source host system has failed due toan unrecoverable OS/hypervisor kernel error, which is a relativelycommon occurrence in large-scale computing environments.

2. Example Computing Environment and Solution Architecture

FIG. 1 is a simplified block diagram of an example computing environment100 in which the techniques of the present disclosure may beimplemented. As shown, computing environment 100 includes at least twophysical computer systems (i.e., a source host system 102 and a failoverhost system 104) that are communicatively coupled with each other andwith a shared storage backend 106 via a network 108. In one set ofembodiments, shared storage backend 106 may be a standalone storageserver/appliance (or group of such servers/appliances), such as astorage array. In another set of embodiments, shared storage backend 106may represent a logical aggregation of storage resources that are partof (i.e., local to) host systems 102 and 104, such as the virtualstorage pool of a hyperconverged infrastructure (HCl) cluster.

Each host system 102/104 includes, in software, an OS or hypervisor110/112 that provides an environment in which user workloads (e.g.,applications, virtual machines (VMs), containers, etc.) can run. Forexample, source host system 102 includes a VM 114 running on itsOS/hypervisor 110.

In addition, each host system 102/104 includes, in hardware, one or morephysical memory modules 116/118 that provide a byte-addressable memorystore for the host system's workloads and a NIC 120/122 that enablescommunication between the host system and other entities over network108. NICs 120 and 122 are RDMA capable and thus can transfer datadirectly between the physical memory modules of their respective hostsystems via an RDMA-enabled network protocol (e.g., InfiniBand, RDMAover Converged Ethernet (RoCE), or Internet Wide Area RDMA Protocol(iWARP)), without involving OSs/hypervisors 110 and 112.

In FIG. 1 , it is assumed that OS/hypervisor 110 of source host system102 is configured to allocate one or more portions of physical memorymodule(s) 116 as a persistent memory region 124 and expose persistentmemory region 124 in the form of a virtual persistent memory module (orin other words, virtual NVDIMM) 126 to VM 114. As known in the art,persistent memory is a type of computer memory that is byte-addressablelike conventional volatile DRAM but is non-volatile in nature likeconventional storage. Through this mechanism, the guest processes of VM114 can access the portion(s) of physical memory module(s) 116 backingvirtual NVDIMM 126/persistent memory region 124 using persistent memorysemantics (i.e., via byte-addressable memory I/O and with theexpectation that any data written there will persist across powercycles).

In addition, it is assumed that some component within the software stackof source host system 102 is configured to periodically flush, to sharedstorage backend 106, the data written by VM 114 to virtual NVDIMM126/persistent memory region 124. In the case of a failure at sourcehost system 102 that prevents VM 114 from continuing to run there, thisperiodic flushing allows failover host system 104 to retrieve thecurrent state of persistent memory region 124 from persistent storagebackend 106, reconstruct this region in its physical memory modules 118,and resume execution of VM 114 (or more precisely, a migrated copy of VM114) using the reconstructed persistent memory region. This periodicflushing also enables the data contents of persistent memory region 124to be persisted across power cycles of source host system 102 in thescenario where the physical memory modules backing persistent memoryregion 124 are volatile DRAM modules (rather than actual NVDIMMs).

As indicated in the Background section, one complication withperiodically flushing persistent memory region 124 to shared storagebackend 106 is that, at the time a failure occurs at source host system102, there may be some remaining dirty data in persistent memory region124 that has not been flushed yet (due to being written by VM 114 afterthe last flush operation). This remaining dirty data must be recoveredin some way in order for persistent memory region 124 to be correctlyreconstructed on failover host system 104 and for VM 114 to be resumedthere. One approach is to employ a post-fail agent in source host system102 that identifies and synchronizes the remaining dirty data to sharedstorage backend 106 after the failure has occurred. However, if thefailure causes the kernel of OS/hypervisor 110 to crash, the post-failagent cannot be trusted to correctly carry out its duties and thus thisapproach cannot be reliably used.

To address the foregoing and other similar issues, FIG. 2 depicts amodified version of computing environment 100 of FIG. 1 (i.e.,environment 200) that includes a novel RDMA setup agent 202/204 in eachhost system 102/104 and a novel RDMA-based recovery agent 206 infailover host system 104 according to embodiments of the presentdisclosure. At a high level, RDMA setup agents 202 and 204 can carry outa setup workflow that involves (1) establishing an RDMA connectionbetween host systems 102 and 104 via respective RDMA-capable NICs 120and 122, and (2) granting failover host system 104/NIC 122 RDMA accessto one or more lists of memory pages in persistent memory region 124that are dirtied by VM 114 but not yet flushed to shared storage backend106, as well as to the in-memory data contents of those dirty memorypages.

Further, at the time of a failure at source host system 102 andconsequent migration of VM 114 from source host system 102 to failoverhost system 104, RDMA-based recovery agent 206 can carry out a recoveryworkflow that involves, inter alia: (1) retrieving a baseline copy ofpersistent memory region 124 from shared storage backend 106 into anewly-created persistent memory region R on failover host system 104;(2) reading, via the RDMA connection created during the setup workflow,the one or more lists of dirty memory pages for persistent memory region124 from source host system 102; (3) for each dirty memory page P in theone or more lists, copying, via the RDMA connection, the data contentsof P from source host system 102 to an appropriate offset of R, and (4)mapping R to the migrated version of VM 114.

With this general approach, persistent memory region 124 can be fullyreconstructed on failover host system 104 in response to a failure atsource host system 102, which in turn allows VM 114 to resume executionon failover host system 104 while source host system 102 is repaired orreplaced. This is true even if source-side OS/hypervisor 110 is renderedunstable or inoperable by the failure, because the transfer of data viaRDMA (per steps (2) and (3) of the recovery workflow above) does notrequire any involvement by OS/hypervisor 110. Accordingly, unlike otherHA approaches, the techniques of the present disclosure provide an HAsolution that is robust against a wide variety of commonly occurringfailure types/modes, including those that arise out of an unrecoverableOS/hypervisor kernel error.

The remaining sections of this disclosure provide additional detailsregarding the setup and recovery workflows performed by RDMA setupagents 202, 204 and RDMA-based recovery agent 206, as well as certainmodifications to these workflows to support cases in which source hostsystem 102 uses a CPU hardware feature known as Page ModificationLogging (PML) to facilitate the tracking of dirtied memory pages. Itshould be appreciated that the computing environment and solutionarchitecture shown in FIG. 2 are illustrative and not intended to limitembodiments of the present disclosure. For example, although FIG. 2depicts a single VM and a single virtual NVDIMM/persistent memory regionfor that VM on source host system 102, in other embodiments there may bemultiple VMs running on source host system 102, each with one or morevirtual NVDIMMs mapped to corresponding persistent memory regions inOS/hypervisor 110. In these embodiments, the RDMA-based HA approach ofthe present disclosure may be applied to recover dirty data in eachpersistent memory region of each such VM.

In addition, while FIG. 2 and the foregoing description assumes that thedirty data being recovered from source host system 102 is data in apersistent memory region, the RDMA-based approach of the presentdisclosure is not limited to the recovery of persistent memory. Instead,this approach can be broadly applied to recover dirty data in any typeof memory region (e.g., volatile memory, persistent memory, etc.) of asource host system in response to a failure at that system. One ofordinary skill in the art will recognize other variations,modifications, and alternatives.

3. RDMA Setup Workflow

FIG. 3 depicts a flowchart 300 of the setup workflow that may beexecuted by RDMA setup agents 202 and 204 of host systems 102 and 104according to certain embodiments.

Starting with block 302, RDMA setup agents 202 and 204 can establish anRDMA connection between host systems 102 and 104 via their respectiveRDMA-capable NICs 120 and 122. Although the details of this process arebeyond the scope of the present disclosure, it generally involvescreating a “queue-pair” on each host system (comprising RDMA send andreceive queues) and exchanging information regarding these queue pairs,as well as authentication security keys.

Once the RDMA connection has been established, RDMA setup agent 204 offailover host system 104 can transmit, to RDMA setup agent 202 of sourcehost system 102, one or more requests to register portions of physicalmemory on source host system 102 that hold (A) the data contents ofpersistent memory region 124, (B) one or more lists of dirty memorypages in persistent memory region 124 (in other words, memory pages thatare written by VM 114 but not yet flushed to shared storage backend106), and (C) associated metadata for persistent memory region 124(block 304). With regard to (B) (i.e., the one or more dirty memory pagelists), each entry in each list can include the machine page number(MPN) of the dirty memory page in physical memory module(s) 116 ofsource host system 102 and a logical offset for that page in persistentmemory region 124. In embodiments where source host system 102 uses PMLto track dirty memory pages, the one or more dirty memory page lists canspecifically include two lists: a first list of dirty memory pagesmaintained in a “PML memory” and a second list of dirty memory pagesmaintained in a “dirty drain buffer” (described in further detail insection (5) below).

With regard to (C) (i.e., associated metadata for persistent memoryregion 124), this metadata can include, among other things, a mappingbetween persistent memory region 124 and virtual NVDIMM 126 of VM 114.

In response to the request(s) sent at block 304, RDMA setup agent 202 ofsource host system 102 can identify the portions of physical memorycontaining (A), (B), and (C) (block 306), register each of theseportions as an RDMA region in source-side NIC 120 (which enablesfailover-side NIC 122 to access these regions at the time of recovery)(block 308), and transmit the starting memory address and size of eachregistered memory portion to RDMA setup agent 204 of failover hostsystem 104 (block 310). Or as an alternative to sending the startingmemory address and size of each memory portion separately, RDMA setupagent 202 can send the starting memory address and size of a“superblock” within the physical memory of source host system 102 thatholds the starting memory addresses and sizes of the registered memoryportions. In this scenario, the memory location of the superblock itselfwill also be registered as an RDMA region on source-side NIC 120.

Finally, at block 312, RDMA setup agent 204 can receive and save theinformation transmitted at block 310 for later use by RDMA-basedrecovery agent 206.

4. Recovery Workflow

FIG. 4 depicts a flowchart 400 of the recovery workflow that may beexecuted by RDMA-based recovery agent 206 of failover host system 104 inresponse to a failure at source host system 102 according to certainembodiments. Flowchart 400 assumes that failover host system 104 hasbeen selected by, e.g., a HA management component within computingenvironment 200 as the failover target for persistent memory region124/VM 114. Flowchart 400 also assumes that VM 114 has been migratedfrom source host system 102 to failover host system 104 (resulting in amigrated VM M at failover host system 104).

Starting with block 402, OS/hypervisor 112 of failover host system 104can detect that a virtual NVDIMM (i.e., virtual NVDIMM 126 shown inFIGS. 1 and 2 ) exists in the virtual machine configuration file formigrated VM M and can invoke RDMA-based recovery agent 206.

At block 404, RDMA-based recovery agent 206 can allocate a newpersistent memory region R in physical memory module(s) 118 of failoverhost system 104 that is equal in size to the virtual NVDIMM detected atblock 402. RDMA-based recovery agent 206 can then retrieve a “baseline”copy of persistent memory region 124 (i.e., a point-in-time copy ofpersistent memory region 124 as of the last periodic flush at sourcehost system 102) from shared storage backend 106 and populate thereceived copy into persistent memory region R (block 406).

Upon populating persistent memory region R with the baseline copy ofpersistent memory region 124 from shared storage backend 106, RDMA-basedrecovery agent 206 can begin the process of copying over the remainingdirty data for persistent memory region 124 from source host system 102using RDMA (and in particular, via the RDMA connection established inthe setup workflow). For example, at block 408, RDMA-based recoveryagent 206 can allocate a memory buffer in physical memory module(s) 118and can issue, via NIC 122, one or more RDMA read requests tosource-side NIC 120 for the one or more dirty memory page listspreviously registered at block 308 of flowchart 300. These requests,which can include the starting memory addresses and sizes of thesource-side memory regions holding the lists, can cause NIC 120 ofsource host system 102 to retrieve the list(s) from physical memorymodule(s) 116 and send them to NIC 122 of failover host system 104,which can receive and write the list(s) to the memory buffer allocatedat block 408 (block 410).

At block 412, RDMA-based recovery agent 206 can parse the dirty memorypage list(s) in the memory buffer and process each entry E in thelist(s) (either sequentially or in parallel) via a loop beginning atblock 414. Within this loop, RDMA-based recovery agent 206 can readentry E, which can include the source-side MPN for the dirty memory pagecorresponding to E and the logical offset of this memory page inpersistent memory region 124. RDMA-based recovery agent 206 can thenissue, via NIC 122, an RDMA read request directed to the MPN andidentifying the logical offset to NIC 120 of source host system 102(block 416). This can cause NIC 120 to retrieve the data of that memorypage from physical memory module(s) 116 and send it to NIC 122, whichcan receive and write the data at the specified logical offset withinpersistent memory region R, thereby copying the page's contents into R(block 418).

At block 420, RDMA-based recovery agent 206 can reach the end of thecurrent loop iteration and return to block 414 in order to process thenext dirty memory page entry. Once all of the entries have beenprocessed, persistent memory region R on failover host system 104 willbe fully consistent with persistent memory region 124 on source hostsystem 102. Accordingly, RDMA-based recovery agent 206 can mappersistent memory region R to the virtual NVDIMM of migrated VM M (block422).

Finally, at block 424, OS/hypervisor 112 of failover host system 104 canpower on migrated VM M and flowchart 400 can end.

It should be appreciated that flowchart 400 is illustrative and variousmodifications are possible. For example, as noted with respect to thesetup workflow of FIG. 3 , in some embodiments failover host system 104may receive the starting memory address and size of a superblock whichcontains the starting addresses/sizes of the dirty memory page list(s)and the content of persistent memory region 124. In these embodiments,RDMA-based recovery agent 206 can first retrieve the data of thesuperblock via an RDMA read, parse the superblock data to identify theaddress and size information included therein, and then issue subsequentRDMA reads using that identified information.

Further, although not shown in FIG. 4 , upon completing thereconstruction of persistent memory region 124 on failover host system104 (in the form of persistent memory region R), RDMA-based recoveryagent 206 can invoke RDMA setup agent 204 in order to carry out a newsetup workflow with another host system in computing environment 200,thereby allowing that other host system to act as a new failover targetfor migrated VM M and persistent memory region R in the case where hostsystem 104 experiences a failure.

5. Support for PML-Based Dirty Page Tracking

As mentioned previously, in certain embodiments source host system 102may utilize a CPU hardware feature known as PML to facilitate thetracking of memory pages dirtied by VM 114 in persistent memory region124. When PML is enabled, the CPU of source host system 102automatically records the MPN of each memory page that is dirtied by VM114 in an area of physical memory referred to as PML memory. This PMLmemory has a fixed size; accordingly, when the PML memory becomes full,a trap to OS/hypervisor 110 occurs and the OS/hypervisor moves the dirtymemory pages identified in the PML memory to a separate, larger memoryarea referred to as a dirty drain buffer. An asynchronous process ofOS/hypervisor 110 then periodically flushes the dirty memory pagesidentified in the dirty drain buffer to the copy of persistent memoryregion 124 in shared storage backend 106.

In embodiments where source host system 102 uses PML, it is notsufficient for RDMA-based recovery agent 206 to retrieve the list ofdirty memory pages in the dirty drain buffer of source host system 102and copy over the contents of those pages; recovery agent 206 shouldalso retrieve the list of dirty memory pages in the PML memory for VM114 and copy over the contents of the PML pages as well. According, inthese embodiments, the setup workflow shown in FIG. 3 can be modified sothat RDMA setup agent 102 of source host system 102 registers theportion of physical memory holding the PML memory of VM 114 and providesthe starting memory address and size of this registered memory portionto RDMA setup agent 104 of failover host system 104 (either separatelyor via the superblock method).

Further, the recovery workflow shown in FIG. 4 can be modified so thatRDMA-based recovery agent 206 retrieves the list of dirty memory pagesfrom the PML memory on source host system 104 and copies over thecontent of each PML page via RDMA, after performing these steps for thedirty memory pages in the dirty drain buffer. This ensures thatpersistent memory region R of failover host system 104 will include allof the changes made to persistent memory region 124 of source hostsystem 102 as recorded in the PML memory and the dirty drain buffer.

Certain embodiments described herein can employ variouscomputer-implemented operations involving data stored in computersystems. For example, these operations can require physical manipulationof physical quantities—usually, though not necessarily, these quantitiestake the form of electrical or magnetic signals, where they (orrepresentations of them) are capable of being stored, transferred,combined, compared, or otherwise manipulated. Such manipulations areoften referred to in terms such as producing, identifying, determining,comparing, etc. Any operations described herein that form part of one ormore embodiments can be useful machine operations.

Yet further, one or more embodiments can relate to a device or anapparatus for performing the foregoing operations. The apparatus can bespecially constructed for specific required purposes, or it can be ageneral-purpose computer system selectively activated or configured byprogram code stored in the computer system. In particular, variousgeneral-purpose machines may be used with computer programs written inaccordance with the teachings herein, or it may be more convenient toconstruct a more specialized apparatus to perform the requiredoperations. The various embodiments described herein can be practicedwith other computer system configurations including handheld devices,microprocessor systems, microprocessor-based or programmable consumerelectronics, minicomputers, mainframe computers, and the like.

Yet further, one or more embodiments can be implemented as one or morecomputer programs or as one or more computer program modules embodied inone or more non-transitory computer readable storage media. The termnon-transitory computer readable storage medium refers to any datastorage device that can store data which can thereafter be input to acomputer system. The non-transitory computer readable media may be basedon any existing or subsequently developed technology for embodyingcomputer programs in a manner that enables them to be read by a computersystem. Examples of non-transitory computer readable media include ahard drive, network attached storage (NAS), read-only memory,random-access memory, flash-based nonvolatile memory (e.g., a flashmemory card or a solid-state disk), a CD (Compact Disc) (e.g., CD-ROM,CD-R, CD-RW, etc.), a DVD (Digital Versatile Disc), a magnetic tape, andother optical and non-optical data storage devices. The non-transitorycomputer readable media can also be distributed over a network coupledcomputer system so that the computer readable code is stored andexecuted in a distributed fashion.

In addition, while certain virtualization methods referenced herein havegenerally assumed that virtual machines present interfaces consistentwith a particular hardware system, persons of ordinary skill in the artwill recognize that the methods referenced can be used in conjunctionwith virtualizations that do not correspond directly to any particularhardware system. Virtualization systems in accordance with the variousembodiments, implemented as hosted embodiments, non-hosted embodimentsor as embodiments that tend to blur distinctions between the two, areall envisioned. Furthermore, certain virtualization operations can bewholly or partially implemented in hardware.

Many variations, modifications, additions, and improvements arepossible, regardless the degree of virtualization. The virtualizationsoftware can therefore include components of a host, console, or guestoperating system that performs virtualization functions. Pluralinstances can be provided for components, operations, or structuresdescribed herein as a single instance. Finally, boundaries betweenvarious components, operations, and data stores are somewhat arbitrary,and particular operations are illustrated in the context of specificillustrative configurations. Other allocations of functionality areenvisioned and may fall within the scope of the invention(s). Ingeneral, structures and functionality presented as separate componentsin exemplary configurations can be implemented as a combined structureor component. Similarly, structures and functionality presented as asingle component can be implemented as separate components.

As used in the description herein and throughout the claims that follow,“a,” “an,” and “the” includes plural references unless the contextclearly dictates otherwise. Also, as used in the description herein andthroughout the claims that follow, the meaning of “in” includes “in” and“on” unless the context clearly dictates otherwise.

The above description illustrates various embodiments along withexamples of how aspects of particular embodiments may be implemented.These examples and embodiments should not be deemed to be the onlyembodiments and are presented to illustrate the flexibility andadvantages of particular embodiments as defined by the following claims.Other arrangements, embodiments, implementations, and equivalents can beemployed without departing from the scope hereof as defined by theclaims.

What is claimed is:
 1. A method comprising, upon occurrence of a failureat a source host system: allocating, by a failover host system, a newmemory region in a physical memory of the failover host system thatcorresponds to a memory region in a physical memory of the source hostsystem; retrieving, by the failover host system, metadata pertaining tothe memory region from the source host system via a remote direct memoryaccess (RDMA) connection, wherein the metadata identifies one or moreportions of the memory region, and wherein the RDMA connection waspreviously established between a network interface controller (NIC) ofthe failover host system and a NIC of the source host system before thefailure; and for each of the one or more portions identified by themetadata, copying, by the failover host system, content of the portionfrom the memory region to the new memory region via the RDMA connection.2. The method of claim 1 further comprising, prior to the allocating:receiving a virtual machine (VM) migrated from the source host system inresponse to the failure.
 3. The method of claim 2 wherein the allocatingcomprises: detecting that a virtual persistent memory module exists in aconfiguration file of the VM, the virtual persistent memory module beingmapped to the memory region; and allocating the new memory region tohave a same size as the virtual persistent memory module.
 4. The methodof claim 3 further comprising, after the copying: mapping the virtualpersistent memory module to the new memory region.
 5. The method ofclaim 1 wherein the failure causes an operating system (OS) orhypervisor of the source host system to become inoperable.
 6. The methodof claim 1 further comprising, prior to retrieving the metadata:retrieving a baseline copy of the memory region from a storage backendshared by the source host system and the failover host system, thebaseline copy representing a copy of the memory region as captured via aperiodic flushing operation to the storage backend prior to the failure;and populating the new memory region with the baseline copy.
 7. Themethod of claim 6 wherein the one or more portions of the memory regionidentified by the metadata include data updates absent in the baselinecopy.
 8. A non-transitory computer readable storage medium having storedthereon program code executable by a failover host system, the programcode embodying a method comprising, upon occurrence of a failure at asource host system: allocating a new memory region in a physical memoryof the failover host system that corresponds to a memory region in aphysical memory of the source host system; retrieving metadatapertaining to the memory region from the source host system via a remotedirect memory access (RDMA) connection, wherein the metadata identifiesone or more portions of the memory region, and wherein the RDMAconnection was previously established between a network interfacecontroller (NIC) of the failover host system and a NIC of the sourcehost system before the failure; and for each of the one or more portionsidentified by the metadata, copying content of the portion from thememory region to the new memory region via the RDMA connection.
 9. Thenon-transitory computer readable storage medium of claim 8 wherein themethod further comprises, prior to the allocating: receiving a virtualmachine (VM) migrated from the source host system in response to thefailure.
 10. The non-transitory computer readable storage medium ofclaim 9 wherein the allocating comprises: detecting that a virtualpersistent memory module exists in a configuration file of the VM, thevirtual persistent memory module being mapped to the memory region; andallocating the new memory region to have a same size as the virtualpersistent memory module.
 11. The non-transitory computer readablestorage medium of claim 10 wherein the method further comprises, afterthe copying: mapping the virtual persistent memory module to the newmemory region.
 12. The non-transitory computer readable storage mediumof claim 8 wherein the failure causes an operating system (OS) orhypervisor of the source host system to become inoperable.
 13. Thenon-transitory computer readable storage medium of claim 8 wherein themethod further comprises, prior to retrieving the metadata: retrieving abaseline copy of the memory region from a storage backend shared by thesource host system and the failover host system, the baseline copyrepresenting a copy of the memory region as captured via a periodicflushing operation to the storage backend prior to the failure; andpopulating the new memory region with the baseline copy.
 14. Thenon-transitory computer readable storage medium of claim 13 wherein theone or more portions of the memory region identified by the metadatainclude data updates absent in the baseline copy.
 15. A host systemcomprising: a processor; a physical memory; a network interfacecontroller (NIC); and a non-transitory computer readable medium havingstored thereon program code that, when executed by the processor, causesthe processor to, upon occurrence of a failure at another host system:allocate a new memory region in the physical memory that corresponds toa memory region in a physical memory of said another host system;retrieve metadata pertaining to the memory region from said another hostsystem via a remote direct memory access (RDMA) connection, wherein themetadata identifies one or more portions of the memory region, andwherein the RDMA connection was previously established between the NICof the host system and a NIC of said another host system before thefailure; and for each of the one or more portions identified by themetadata, copying content of the portion from the memory region to thenew memory region via the RDMA connection.
 16. The host system of claim15 wherein the program code further causes the processor to, prior tothe allocating: receive a virtual machine (VM) migrated from saidanother host system in response to the failure.
 17. The host system ofclaim 16 wherein the program code that causes the processor to allocatethe new memory region comprises program code that causes the processorto: detect that a virtual persistent memory module exists in aconfiguration file of the VM, the virtual persistent memory module beingmapped to the memory region; and allocate the new memory region to havea same size as the virtual persistent memory module.
 18. The host systemof claim 17 wherein the program code further causes the processor to,after the copying: map the virtual persistent memory module to the newmemory region.
 19. The host system of claim 15 wherein the failurecauses an operating system (OS) or hypervisor of said another hostsystem to become inoperable.
 20. The host system of claim 15 wherein theprogram code further causes the processor to, prior to retrieving themetadata: retrieve a baseline copy of the memory region from a storagebackend shared by the host system and said another host system, thebaseline copy representing a copy of the memory region as captured via aperiodic flushing operation to the storage backend prior to the failure;and populate the new memory region with the baseline copy.
 21. The hostsystem of claim 20 wherein the one or more portions of the memory regionidentified by the metadata include data updates absent in the baselinecopy.